gemm x86 support out_elemtype, multiheadattention and sdpa x86 support bf16 storage, skip mha bf16 tests by nihui · Pull Request #6623 · Tencent/ncnn

nihui · 2026-03-30T08:33:44Z

No description provided.

codecov-commenter · 2026-03-30T08:37:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.53%. Comparing base (18a7ad1) to head (4ed6121).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6623      +/-   ##
==========================================
+ Coverage   93.45%   93.53%   +0.08%     
==========================================
  Files         874      874              
  Lines      280098   281088     +990     
==========================================
+ Hits       261758   262921    +1163     
+ Misses      18340    18167     -173

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tencent-adm · 2026-03-30T08:43:03Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR extends x86 compute paths to better support bf16 storage and Gemm output element type selection, and updates the test suite accordingly (including temporarily skipping MultiHeadAttention bf16 variants).

Changes:

Add output_elemtype handling to the x86 bf16 Gemm implementation so bf16 inputs can produce fp32 outputs.
Enable bf16 storage support flags for x86 MultiHeadAttention and SDPA, adjusting internal execution to accommodate bf16 storage.
Add a new Gemm test (test_gemm_5.cpp) and update test utilities to skip MultiHeadAttention bf16 testing.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/testutil.cpp	Skips MultiHeadAttention bf16 tests; adds missing `delete op` on Vulkan skip paths (but early-return cleanup still incomplete).
tests/test_gemm_5.cpp	New Gemm test covering `output_elemtype=fp32` across shapes/transposes.
src/layer/x86/sdpa_x86.cpp	Enables bf16 storage and updates intermediate/output allocations and memcpy sizes to respect bf16 elemsize.
src/layer/x86/multiheadattention_x86.cpp	Enables bf16 storage; forces certain sublayers to fp32 and adds a bf16→fp32 cast for V before qkv gemm.
src/layer/x86/gemm_x86.cpp	Threads `output_elemtype` through bf16 Gemm path and allocates/stores fp32 when requested.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/testutil.cpp

src/layer/x86/multiheadattention_x86.cpp

tests/testutil.cpp

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_gemm_5.cpp

src/layer/x86/sdpa_x86.cpp

tests/testutil.cpp

tencent-adm · 2026-03-31T03:48:37Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

nihui · 2026-03-31T07:28:18Z

  MultiHeadAttention (MHA)                                                                                                                                                                                                                  
                                                                                                                                                                                                                                            
  ┌────────┬─────────────────────────────┬─────────────────────────────┐                                                                                                                                                                    
  │ 线程数  │ bf16 (无AVX512BF16) vs fp32 │ bf16 (有AVX512BF16) vs fp32 │                                                                                                                                                                    
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      1 │                0.94x (略慢)  │                       1.37x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      2 │                       0.97x │                       1.31x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      4 │                       0.97x │                       1.24x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      8 │                       0.92x │                       1.14x │
  └────────┴─────────────────────────────┴─────────────────────────────┘

  - bf16 + AVX512BF16 指令 在 MHA 上效果明显，单线程可达 1.37x 加速
  - 小 seqlen + 大 embed_dim 收益最大（如 E=1024, S=128, 8线程: fp32=622 → bf16+avx512bf16=791 GFLOPS）
  - 无 AVX512BF16 的 bf16 反而稍慢 (~5-8%)，因为有 bf16↔fp32 转换开销但没有原生 bf16 运算加速
                                                                                                                                                                                                                                            
  SDPA
                                                                                                                                                                                                                                            
  ┌────────┬─────────────────────────────┬─────────────────────────────┐
  │ 线程数  │ bf16 (无AVX512BF16) vs fp32 │ bf16 (有AVX512BF16) vs fp32 │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      1 │                       0.82x │                       0.99x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      2 │                       0.71x │                       0.83x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      4 │                       0.80x │                       0.93x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      8 │                       0.75x │                       0.85x │
  └────────┴─────────────────────────────┴─────────────────────────────┘

  - SDPA 的 bf16 收益远不如 MHA，大部分场景几何均值低于 fp32
  - 原因：SDPA 纯做注意力计算（QK^T + softmax + QKV），计算量相对少、访存密集，bf16 的 compute 优势在此不明显
  - 在大 seqlen (≥256) 时 bf16+AVX512BF16 才开始接近或略超 fp32（1.02x~1.13x）
  - 小 seqlen（32~128）在多线程下反而严重退化，可能是 packing/conversion 开销占比过高

  关键结论

  1. AVX512BF16 指令对 MHA 有显著加速（1.14x~1.37x），主要受益于 Q/K/V/Out 四个大矩阵投影的 bf16 gemm
  2. 无 AVX512BF16 的 bf16 基本没有正向收益，建议仅在硬件支持 AVX512BF16 时启用 bf16
  3. SDPA 的 bf16 优化空间有限，瓶颈在访存而非计算

nihui and others added 10 commits March 30, 2026 02:20

wip

14562a7

wip

c8c5ccc

wip

a25bf86

wip

290c5b7

wip

44499b0

apply code-format changes

e18e437

test gemm out_elemtype

219cde4

test gemm out_elemtype

22cdbde

skip

9d907e8

fix leak

56ffccb

github-actions bot added test x86 labels Mar 30, 2026

f

133af57

opt

665848e

nihui requested a review from Copilot March 30, 2026 08:48

Copilot started reviewing on behalf of nihui March 30, 2026 08:49 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

tests/testutil.cpp Show resolved Hide resolved

src/layer/x86/multiheadattention_x86.cpp Outdated Show resolved Hide resolved

tests/testutil.cpp Show resolved Hide resolved

tests/testutil.cpp Show resolved Hide resolved

tests/testutil.cpp Show resolved Hide resolved

nihui added 3 commits March 30, 2026 09:03

w

44f532b

f

776f36c

f

ae8052a

nihui requested a review from Copilot March 30, 2026 09:17

Copilot started reviewing on behalf of nihui March 30, 2026 09:17 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

nihui added 5 commits March 30, 2026 11:42

cc

444341c

cc

7eeeddb

cc

c87a5e1

f

6ae8cba

f

4900f1c

nihui and others added 3 commits March 31, 2026 03:05

f

7f7473b

f

e62cde1

apply code-format changes

8505b09

nihui added 4 commits March 31, 2026 03:49

m

a83f122

m

42b2f3a

fff

de54ce3

fff

4ed6121

nihui merged commit 371bbad into Tencent:master Mar 31, 2026
106 of 109 checks passed

Conversation

nihui commented Mar 30, 2026

Uh oh!

codecov-commenter commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tencent-adm commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tencent-adm commented Mar 31, 2026

Uh oh!

nihui commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Mar 30, 2026 •

edited

Loading

tencent-adm commented Mar 30, 2026 •

edited

Loading

nihui commented Mar 31, 2026 •

edited

Loading